# Load packages
library(here)
library(tidyverse)
library(gt)
library(e1071)
library(scales)
library(corrplot)
library(caret)
library(randomForest)
library(glmnet)
library(gbm)
# Read in data
data <- read_csv(here("inputs//data_prep.csv"))
data <- data %>%
mutate(room_type = as.factor(room_type), license = as.factor(license))
For this analysis we’re explore the listings data of Airbnb rentals in Toronto. The data can be found at link
We’re primarily interested in the rental price however we’ll explore the entire dataset for anything interesting and visualize the results. The analysis will conclude with a model for predicting rental prices.
We’ve previously removed all listings with no reviews, we also remove all listings with no availability within the next year as these are likely no longer actively being rented and removed NA columns from the data.
We’ll generate some initial summary statistics of the various predictors to get started:
# Check Summary Statistics
head(data)
Some initial things to note from the summary statistics is that the vast majority of listings(67.7%) are entire homes or apartments as opposed to shared living spaces. Private rooms make up the bulk of the remainder at 31.5%.
data %>%
select(room_type) %>%
group_by(room_type) %>%
summarize(listings = n()) %>%
ungroup() %>%
mutate(room_type = fct_reorder(room_type, listings)) %>%
ggplot(aes(x = room_type, y = listings, fill = room_type)) +
geom_bar(stat = "identity") +
theme(legend.position = "none") +
xlab("Room Type") +
ylab("Listings")
Like any other residential property the neighbourhood is a likely predictor of the price of the property being rented so next be look at the distribution of
Lets now look at the distribution of rental prices:
ggplot(data, aes(x = price)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(title = "Distribution of Airbnb Rental Prices",
x = "Price",
y = "Count")
I’ve identified the major outlier in this case to be this listing. To get a better idea we can check for example how many listings are there above a price of $5000?
dim(data %>%
filter(price > 5000))[1]
## [1] 3
We find there are only 13 listings. We can then look at a boxplot of the remaining listings after removing the ones above 5000.
data %>%
filter(price < 5000) %>%
ggplot(aes(x = price)) +
geom_boxplot(color = "blue", outlier.color = "red", outlier.size = 2) +
scale_x_continuous(breaks = c(0,1000,2000, 3000, 4000, 5000)) +
ylim(-4,4) +
stat_boxplot(geom ='errorbar') +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank())
The data seems to get much more spread out above a price of $1500, so we’ll focus in on those data points:
data %>%
filter(price < 1500) %>%
ggplot(aes(x = price)) +
geom_boxplot(color = "blue", outlier.color = "red", outlier.size = 2) +
scale_x_continuous(breaks = seq(from = 0, to = 1500, by = 250)) +
ylim(-4,4) +
stat_boxplot(geom ='errorbar') +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank())
This seems like a better representation of most of the price data for the listings.
As with any property value we may expect things like the number of bedrooms in the listing to be a predictor of price. Larger properties should have higher bedroom counts and thus be more expensive to rent:
data %>%
filter(price < 1500) %>%
mutate(bedrooms = as.factor(bedrooms)) %>%
select(price, bedrooms) %>%
group_by(bedrooms) %>%
summarize(mean_price = mean(price)) %>%
ungroup() %>%
ggplot(aes(x = bedrooms, y = mean_price, fill = bedrooms)) +
geom_bar(stat = "identity") +
scale_y_continuous(breaks = seq(from = 0, to = 1500, by = 250)) +
theme(legend.position = "none") +
xlab("# of Bedrooms") +
ylab("Mean Price")
There appears to be only 1 listing with 9 bedrooms.
data %>%
filter(price < 1500 & bedrooms == 9)
Because the price is so wildly off from the overall trend I think it’s best to remove this point in the model.
Next it seems likely review scores should be a strong predictor of prices:
data %>%
filter(price < 1500 & bedrooms < 9 & number_of_reviews > 5) %>%
ggplot(aes(x = review_scores_rating, y = price)) +
geom_point(alpha = 0.5, color = "#43a2ca") +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Price vs. Review Scores Rating",
x = "Review Scores",
y = "Price")
We can similarly do some quick visualizations to see the relationships among other predictors and the price variable:
data %>%
filter(price < 1500 & bedrooms < 9 & number_of_reviews > 5) %>%
ggplot(aes(x = minimum_nights)) + geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(title = "Distribution of Minimun Nights",
x = "Nights",
y = "Count")
Clearly we have a pretty wide tail to the right on this distribution, suggesting that there are again a few extreme outliers in this data.
data %>%
filter(price < 1500 & bedrooms < 9 & number_of_reviews > 5) %>%
select(minimum_nights) %>%
group_by(minimum_nights) %>%
summarize(n = n()) %>%
arrange(desc(n))
data %>%
filter(price < 1500 & bedrooms < 9 & number_of_reviews > 5 & minimum_nights > 1000)
Interestingly by far the most common minimum nights selection on AirBNB is full month stays. Followed by 1,2,3 nights. The outliers we’re seeing in minimum nights largely seem to be inactive listings, with last reviews here for example being from 2016. We would like to remove listings like this but instead of making a cut off for minimum nights we will seek to do this by last review date. This ensures that the pricing we’re seeing in the data is currently active listings.
data %>%
filter(price < 1500 & bedrooms < 9 & number_of_reviews > 5) %>%
ggplot(aes(x = last_review)) + geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(title = "Distribution of Latest Review Date",
x = "Last Review Date",
y = "Count")
As we can see the vast majority of these listings are recent, however we are getting a stretch of listings dating back all the way to 2015. For the sake of ensuring the pricing is accurate especially considering the price disturbances caused during Covid, we’ll remove listings who have not received a review after 2020.
data %>%
filter(price < 1500 & bedrooms < 9 & number_of_reviews > 5 & last_review >= '2020-01-01')